The influence of semantics in IR using LSI and K-means clustering techniques

نویسندگان

  • Daniel Jiménez
  • Edgardo Ferretti
  • Vicente Vidal
  • Paolo Rosso
  • Carlos F. Enguix
چکیده

In this paper we study the influence of semantics in the information retrieval preprocessing. We concretely compare the reached performance with stemming and semantic lemmatization as preprocessing. Three techniques are used in the study: the direct use of a weighted matrix, the SVD technique in the LSI model and the bisecting spherical k-means clustering technique. Although the results seem not to be very promising, we believe that they should be improved in the future. 1. BACKGROUND AND MOTIVATION The Information Retrieval (IR) models used in this work are classified within the vector space model, included in the classic model. The actual models used are the generalized vector space model and the Latent Semantic Indexing (LSI) [1]. These models are based in the well-known matrix of terms by documents, which generally is a weighted matrix and rarely a frequency matrix [2]. The terms by documents matrix is constructed from a collection of documents. The process to obtain this matrix requires a preprocessing of that collection. There are various techniques in the preprocessing part, each one handle one or more aspects, i.e. reduce the number of terms that represent the collection, treatment of related words, etc. After the preprocessing of the collection, a frequency matrix is constructed, which usually is transformed to a weighted matrix. There are many schemes to weight a frequency matrix, but a reasonable election is to use “term frequency” as the term frequency component and to use “inverse document frequency” as the collection frequency component [14], this scheme is used in this work. With the weighted matrix we model the information retrieval system induced by the document collection. But we assume, too, two others models, a LSI model (using the SVD technique) and a clustering model (using the bisecting spherical k-means algorithm [9]). The criteria used to evaluate the experiments, has been the average precision-recall ratio [1]:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Categorization and Information Retrieval Using WordNet Senses

In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct us...

متن کامل

Tech. Report: Matrix dimensionality reduction for LSI using Spherical K-means

In this paper, we propose using Spherical K-means algorithm as a preprocessing step to using Latent Semantic Indexing (LSI). LSI is a well known approach in Information Retrieval (IR). Spherical K-means is a fast clustering algorithm that puts similar documents together, thus forming K clusters. We propose using Spherical Kmeans to form the matrix of normalized concept vectors yields high reduc...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Extraction and 3D Segmentation of Tumors-Based Unsupervised Clustering Techniques in Medical Images

Introduction The diagnosis and separation of cancerous tumors in medical images require accuracy, experience, and time, and it has always posed itself as a major challenge to the radiologists and physicians. Materials and Methods We Received 290 medical images composed of 120 mammographic images, LJPEG format, scanned in gray-scale with 50 microns size, 110 MRI images including of T1-Wighted, T...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003